Data AnalyticsData Analysis using Python - Java, Java Swing, OOAD, MIS, DSA

Data Analytics

Data Analysis using Python

Given a DataFrame named df with columns City, Sales, and Quarter, write the pandas code to filter and display only the rows where the City is "New York" and Sales are greater than 50,000.

# Using boolean indexing with the & (AND) operator
filtered_df = df[(df['City'] == 'New York') & (df['Sales'] > 50000)]
print(filtered_df)

Create a DataFrame sales with columns: Product, Region, Sales, Profit. Add 5 rows of sample data. Display the first 3 rows.

import pandas as pd

data = {
    'Product': ['A', 'B', 'C', 'A', 'B'],
    'Region': ['North', 'South', 'North', 'East', 'South'],
    'Sales': [100, 200, 150, 300, 250],
    'Profit': [20, 40, 30, 60, 50]
}

sales = pd.DataFrame(data)
print(sales.head(3))

From the sales DataFrame, filter rows where Sales > 150 and Region == 'South'.

    filtered = sales[(sales['Sales'] > 150) & (sales['Region'] == 'South')]
print(filtered)

Find the mean, sum, and standard deviation of Sales column.

    mean_sales = sales['Sales'].mean()
sum_sales = sales['Sales'].sum()
std_sales = sales['Sales'].std()

print("Mean Sales:", mean_sales)
print("Sum Sales:", sum_sales)
print("Std Dev Sales:", std_sales)

For the Sales column, calculate a 2-period moving average.

    sales['Sales_MA2'] = sales['Sales'].rolling(window=2).mean()
print(sales)

You have a DataFrame df tracking daily stock prices in a column named Close_Price.

# 1. Finding the median
median_price = df['Close_Price'].median()

# 2. Calculating the 7-day moving average
df['7_Day_MA'] = df['Close_Price'].rolling(window=7).mean()

Using Sales and Profit columns, apply KMeans with 2 clusters.

    from sklearn.cluster import KMeans

X = sales[['Sales', 'Profit']]
kmeans = KMeans(n_clusters=2, random_state=0)
sales['Cluster'] = kmeans.fit_predict(X)

print(sales)

You want to group your customers into 4 distinct segments using K-Means based on their Age and Spending_Score. Assuming your data is already scaled and stored in a variable X, write the scikit-learn code to initialize, fit, and predict these clusters.

from sklearn.cluster import KMeans # Initialize the model with 4 clusters kmeans = KMeans(n_clusters=4, random_state=42) # Fit the model and predict the cluster labels cluster_labels = kmeans.fit_predict(X) # (Optional) Save back to your original dataframe # df['Cluster'] = cluster_labels

You are building a Decision Tree to predict whether a customer will "Churn" (Yes/No). Your features are in X and target is in y.

Initialize a Decision Tree with a maximum depth of 5.

Fit the model to your data.

from sklearn.tree import DecisionTreeClassifier

# 1. Initialize with max_depth limit to prevent overfitting
dt_classifier = DecisionTreeClassifier(max_depth=5, random_state=42)

# 2. Fit the model
dt_classifier.fit(X, y)

In Market Basket Analysis, if a store has 1,000 total transactions, and Bread is bought in 200 of them, while Milk and Bread are bought together in 50 transactions:

What is the Support of Bread?

What is the Confidence of the rule Bread-->Milk?

You have a raw dataset of employee data. You want to see the total (sum) salary paid out, broken down by Department (as rows) and Gender (as columns). How would you set this up?

    pivot_df = df.pivot_table(values='Salary', index='Department', columns='Gender', aggfunc='sum')

You have two sheets/DataFrames:

Orders: Contains Order_ID and Customer_ID.

Customers: Contains Customer_ID and Customer_Name.
You want to bring the Customer_Name into the Orders table. If you were doing this in Pandas, what function would you use?

# Merging the two dataframes on the common column 'Customer_ID'
updated_orders = orders.merge(customers[['Customer_ID', 'Customer_Name']], on='Customer_ID', how='left')

Using Sales and Profit as features, predict Region (convert to numeric first).

    from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
sales['Region_Code'] = le.fit_transform(sales['Region'])

X = sales[['Sales', 'Profit']]
y = sales['Region_Code']

model = DecisionTreeClassifier()
model.fit(X, y)

# Predict for a new sale (Sales=200, Profit=45)
print(model.predict([[200, 45]]))  # Output: encoded region

Given a transaction dataset, find frequent itemsets with min support = 0.5.

    from mlxtend.frequent_patterns import apriori, association_rules

# Sample transaction data in one-hot format
data = {'Milk': [1, 0, 1, 1],
        'Bread': [1, 1, 0, 1],
        'Butter': [0, 1, 1, 0]}

df = pd.DataFrame(data)

frequent = apriori(df, min_support=0.5, use_colnames=True)
rules = association_rules(frequent, metric="lift", min_threshold=1)

print(frequent)
print(rules)

Create a pivot table showing total Sales per Region and Product.

    pivot = pd.pivot_table(sales, values='Sales', index='Region', columns='Product', aggfunc='sum')
print(pivot)

You have two DataFrames: sales (Product, Sales) and prices (Product, Price). Add Price to sales DataFrame using Product as key.

    prices = pd.DataFrame({
    'Product': ['A', 'B', 'C'],
    'Price': [10, 15, 12]
})

sales_with_price = pd.merge(sales, prices, on='Product', how='left')
print(sales_with_price)

What is a DataFrame?
A two-dimensional table in Pandas.

Which function displays the first 5 rows?
df.head()

Which function calculates average?
mean()

What is a moving average?
Average of recent observations over a window.

What does merge() do?
Combines two DataFrames using a common column.

Is K-Means supervised or unsupervised?
Unsupervised.

What is the purpose of a Decision Tree?
Classification or prediction.

What does Apriori find?
Frequent itemsets and associations.

What is a pivot table?
A summary table for aggregation.

Which function creates a pivot table?
pd.pivot_table()

Online-Academy

Look, Read, Understand, Apply

Data Analytics